library(RColorBrewer)
library(mvtnorm)
library(latex2exp)
library(car)
cols <- brewer.pal(9, "Set1")
What is the best representation of the data in 0 dimensions? Of course it is the sample mean: it minimizes the residual sum of squares (sum of squares differences between the points and the mean).
n <- 100
mu <- c(1,2)
sig <- matrix(c(1,1,1,4), 2, 2, byrow = T)
X <- rmvnorm(n, mu, sig)
par(mar=c(4,4,2,2), family = 'serif')
plot(X, asp = 1, pch = 16, xlab = TeX("$X_1$"), ylab = TeX("$X_2$"))
segments(x0 = X[,1], y0 = X[,2], x1 = colMeans(X)[1], y1 = colMeans(X)[2])
points(colMeans(X)[1], colMeans(X)[2], col = cols[1], pch = 17)
What is the best representation of the data in 1 dimension? It is the line that captures most of the variability of the data!
M <- colMeans(X)
S <- cov(X)
par(mar=c(4,4,2,2), family = 'serif')
plot(X, asp = 1, pch = 16, xlab = TeX("$X_1$"), ylab = TeX("$X_2$"))
points(colMeans(X)[1], colMeans(X)[2], col = cols[1], pch = 17)
ellipse(M, S, 2.5, add = T, col = cols[1], lty = 2)
# Compute the principal components
PC <- prcomp(X, scale = F)
PC1 <- NULL
for(i in 1:n){
# The loadings are the directions that define the PC
PC1 <- rbind(PC1, PC$rotation[,1]*PC$x[i,1] + M)
}
par(mar=c(4,4,2,2), family = 'serif')
plot(X, asp = 1, pch = 16, xlab = TeX("$X_1$"), ylab = TeX("$X_2$"))
points(PC1, col = cols[1], pch = 16, cex = 0.75)
segments(x0 = X[,1], y0 = X[,2], x1 = PC1[,1], y1 = PC1[,2], col = cols[1])
points(X, pch = 16)
par(mar=c(4,4,2,2), family = 'serif')
plot(X, asp = 1, pch = 16, xlab = TeX("$X_1$"), ylab = TeX("$X_2$"))
arrows(x0 = c(0,0), y0 = c(0,0), x1 = 5*PC$rotation[1,], y1 = 5*PC$rotation[2,], col = 'red', length = 0.1)
Let us first load data on food consumption for all European countries.
rm(list=ls())
cols <- brewer.pal(9, "Set1")
food <- read.table('Food.txt', header = T)
scale_food <- as.matrix(scale(food))
p <- ncol(food)
plot(food, pch = 16)
Each point represents a country. It is hard to visualize such high dimensional data and to understand visually the relationships.
For that, we can use PCA.
pc_fit <- prcomp(food, scale = T)
scores_fit <- pc_fit$x
load_fit <- pc_fit$rotation
The object pc_fit$x
is a \(n \times p\) contains the scores of the data, whereas pc_fit$rotation
is the rotation matrix, therefore it is a \(p \times p\) matrix that has the PC loadings (directions) in the columns.
Let’s show the proportion of variance explained by each PC.
layout(matrix(c(2,3,1,3), 2, byrow = T))
barplot(pc_fit$sdev^2, las = 2, main = 'Principal Components', ylim = c(0,3),
ylab = 'Variances')
plot(cumsum(pc_fit$sdev^2)/sum(pc_fit$sde^2), type = 'b', axes = F,
xlab = 'n° components', ylab = '% explained variance', ylim = c(0,1))
abline(h = 1, col = 'blue')
abline(h = 0.8, lty = 2, col = 'blue')
box()
axis(2, at = 0:10/10, labels = 0:10/10)
axis(1, at = 1:p, labels = 1:p, las = 2)
barplot(apply(scale_food, 2, sd)^2, las = 2, main = 'Original Variables',
ylab = 'Variances')
We can see that he variances of 5th, 6th, 7th, 8th PC are very small: not useful to explain the data
We can now show the loading plot, which will help us interpret the PC.
par(mar = c(2,2,2,0), mfrow = c(4,1))
for(i in 1:4){
barplot(load_fit[,i], ylim = c(-1, 1))
abline(h = 0)
}
Remember that these are the directions of the PCs. The first PC explains \(50\%\) of the variability seems to be highlighting a contrast (some are positive, some negative) between consumption of eggs, milk, mean and pork meat vs cereals and pulse. It is a contrast between livestock and agriculture goods.
Fruits and fish are excluded: they have a tiny contribution in the first PC and they are more important in the second PC. In the second PC, there seems to be a contrast between fish and fruits and the other variables.
Let us now visualize the score plot. It is a plot of the data in the space of the first two (in this case) PC. We can also add the original axis to see how the PC are related to them.
par(mar=c(4,4,2,2), family = 'serif')
biplot(pc_fit, scale = 0, cex = 0.8)
The interpretation is clear: mediterranean countries consume pulse, fruits and fish; eastern european countries consume cereals. The other countries consume mostly mean and dairy products.
Remember the iris
dataset? Let’s pretend for a second that we don’t know the species (label coloring).
rm(list=ls())
cols <- brewer.pal(9, "Set1")
data <- as.matrix(iris[sample(1:nrow(iris), nrow(iris)),1:4])
par(mar=c(4,4,2,2), family = 'serif')
pairs(data, pch = 16)
How many groups do you see?
iris_dist <- dist(data, method='euclidean')
iris_sin <- hclust(iris_dist, method = 'single')
iris_ave <- hclust(iris_dist, method = 'average')
iris_com <- hclust(iris_dist, method = 'complete')
par(mar=c(4,4,2,2), mfrow=c(1,3), family = 'serif')
plot(iris_sin, main='euclidean-single', hang=-0.1, xlab='', labels=F, cex=0.6, sub='')
plot(iris_ave, main='euclidean-complete', hang=-0.1, xlab='', labels=F, cex=0.6, sub='')
plot(iris_com, main='euclidean-average', hang=-0.1, xlab='', labels=F, cex=0.6, sub='')
Dendrograms: tree representation of the nested clustering structure. Long vertical branches represent that the clusters are robust.
labels_com <- cutree(iris_ave, k = 3)
labels_com
104 145 5 4 120 33 77 75 34 28 117 114 147 43 60 71 84 146 132 47
1 1 2 2 3 2 3 3 2 2 1 3 3 2 3 3 3 1 1 2
133 88 7 96 112 36 103 3 39 15 99 130 100 6 142 127 86 107 70 19
1 3 2 3 1 2 1 2 2 2 3 1 3 2 1 3 3 3 3 2
124 102 26 128 76 49 12 138 14 40 30 85 27 41 18 106 24 50 46 143
3 3 2 3 3 2 2 1 2 2 2 3 2 2 2 1 2 2 2 3
113 129 95 57 56 136 122 98 94 118 48 123 42 149 8 69 83 92 23 11
1 1 3 3 3 1 3 3 3 1 2 1 2 1 2 3 3 3 2 2
45 116 29 44 87 115 55 82 53 91 64 110 111 93 10 25 32 121 139 105
2 1 2 2 3 3 3 3 3 3 3 1 1 3 2 2 2 1 3 1
17 140 126 9 74 65 135 59 81 38 150 73 137 52 51 109 1 16 67 20
2 1 1 2 3 3 1 3 3 2 3 3 1 3 3 1 2 2 3 2
63 78 141 61 22 37 2 21 101 68 35 108 58 90 131 31 72 119 148 79
3 3 1 3 2 2 2 2 1 3 2 1 3 3 1 2 3 1 1 3
66 97 13 134 80 144 62 54 89 125
3 3 2 3 3 1 3 3 3 1
par(mar=c(4,4,2,2), family = 'serif')
pairs(data, pch = 16, col = cols[labels_com])
Let us try k-means with \(k = 3\).
set.seed(1234)
fit <- kmeans(data, centers = 3)
par(mar=c(4,4,2,2), family = 'serif')
pairs(data, pch = 16, col = cols[fit$cluster])
A good heuristic in practice to find a good value of \(k\) is to see when the within cluster squared distances (variability) stops decreasing drastically.
K <- 20
WCV <- array(NA, dim = K)
for (k in 1:K){
fit <- kmeans(data, centers = k)
WCV[k] <- sum(fit$withinss)
}
plot(1:K, WCV, pch = 16)